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Systematic inaccuracy is inherent in any computational estimate of a non-linear average, due 
to the availability of only a finite number of data values, TV. Free energy differences A_F between 
two states or systems are critically important examples of such averages in physical, chemical and 
biological settings. Previous work has demonstrated, empirically, that the "finite-sampling error" 
can be very large — many times ksT — in AF estimates for simple molecular systems. Here, we 
present a theoretical description of the inaccuracy, including the exact solution of a sample problem, 
the precise asymptotic behavior in terms of 1/7V for large A*', the identification of universal law, and 
numerical illustrations. The theory relies on corrections to the central and other limit theorems, 
and thus a role is played by stable (Levy) probability distributions. 



Introduction. Free energy difference calculations have a tremendous range of applications in physical, chemical, 
and biological systems; examples include computations relating magnetic phases, estimates of chemical potentials, 
and of binding affinities of ligands to proteins (e.g., Since the work of Kirkwood 0|, it has been appreciated 

that the free energy difference, AF = Ai<o^i, of switching from a Hamiltonian Tio to Hi is given by a non-linear 
average, 



AF = -fcsT log [ ( exp (-W^o^i/fesT) )o ] . 



(1) 



where ksT is the thermal unit of energy at temperature T and Wq^i is the work required to switch the system 
from Tig to Tii. The angled brackets indicate an average over switches starting from configurations drawn from the 
equilibrium distribution governed by Hq. In instantaneous switching the work is defined by Wo^i = 7ii(x) — 7io(x) 
for a start (and end) configuration x; however, gradual switches requiring a "trajectory" -based work definition may 
also be used as was pointed out by Jarzynski HJ^. 
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FIG. 1. Finite-sampling error for Gaussian-distributed work values. The expected value of the dimensionless finite-sampling 
inaccuracy, (AF„ — AF)/fcflT for n data points is plotted as a function of 1/n. From top to bottom, the data sets represent 
numerical values of the error for Gaussian distributions of work values with standard deviations, (Tiu/fcaT of 3, 2, 1.5, and 1. 
The right panel also shows the exact, asymptotic linear behavior for the smallest widths. 



Whenever a convex, nonlinear average such as (|^) is estimated computationally, that result will always be systemat- 
ically biased because one has only a finite amount of data — say, N work values. The bias results from incomplete 
sampling of the smallest (or most negative) Wq^i values: these values dominate the average (|l|) and cannot be sam- 
pled perfectly for finite N, regardless of the Wo-+i distribution. Thus, a running estimate of AF will typically decline 
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as data is gathered. Such considerations led Wood et al. ||Tl[ to consider the block-averaged n-data-point estimate of 
the free energy based on = mn total work values {VK*^*^^ namely, 



AF^ = - V-fcsTlog 



- E 



exp 



k={j-l)n+l 



(2) 



In the limit m, N oo, AF^ is mathematically well-defined and amenable to analysis; it represents the expected 
value of a free energy estimate from n data points — that is, of 



(3) 



See Fig. |l]. Wood et al. estimated the lowest order correction to AF = AFoo as a'^/2nkBT, where cr^ is the variance 
in the distribution of work values, W Ferrenberg, Landau and Binder discussed analogous issues for the magnetic 
susceptibility 

More recently, Zuckerman and Woolf [H) suggested a means by which a range of AFn values for n < could be 
used to extrapolate to the true, infinite-data answer, AF. The authors also observed that, for large m — N/n, the 
free energy is bounded according to 



AF < AFn , any n . 



(4) 



This inequality results from the convexity of the exponential function, as will be demonstrated explicitly in a fuller 
account of the theory. Finally, Zuckerman and Woolf noted that the leading behavior of AF„ appeared to be not 
always linear in 1/n but, rather, seemed to behave as (l/n)"^^ for ri < 1. 

This Letter presents the theory — apparently for the first time — describingthe finite-sampling inaccuracy for AF 
estimates. Previous work discussing AFn has been, primarily, empirical |]lT| , |l3| . Our report includes (i) the formal 
analytic expression for the expected value of the error from N work values, AFn ~ AF; (ii) an exact solution, for all N, 
of this expected value when the Boltzmann factor of the work value z = e-^/><^BT follows a gamma distribution; (iii) 
exact asymptotic expressions for AFn a-nd the variance of as n ^ oo for arbitrary W distribtions, including non- 
analytic behavior in the case when the variance and higher moments of z diverge; and (iv) discussion and numerical 
illustrations based on Gaussian distributions of W, plus corrections expected from skewed Gaussian distributions. The 
pre sent discussion makes use of mathematical results regarding the convergence — to "stable" limiting distributions 
|p^-|l6|, also known as Levy processes (e.g., — of the distributions of sums of variables. The results are expected 
to have practical application in the extrapolation process outlined in [|l3| . 

Formal Development of AFn- The derivation proceeds via continuum expressions simplified by the definitions 
w = W/kBT, f = AF/ksT, and /„ = AFn/ksT. First, in terms of the probability density of work values, which 
is normalized by J dwpwiw) = 1, the free energy is given by the continmmr analog of (Q), 



/ = AF/keT = - log 



dw Pw{w) e 



(5) 



The finite-data average free energy, following (g) must apply the logarithm "before" the average of the n Boltzmann 
factors, and one has 



/n " 1 " 

Y\ [dWi Pw{wi)] log - ^ 
A — 1 ^ A — 1 



(6) 



Now, motivated by the central and related limit theorems ||T^,|l^,|T^ for the sum of the e ™ variables, we introduce 
a change of variables which will permit the development of a 1/n expansion for /„. In particular, we define 



y 



(e"'"! + • • • + e-"'" - ne-f) / bin^/" , 



(7) 



where bi is a constant and a < 2 is an exponent characterizing the distribution of the variable e ™. In fact, the 
requirement that A_F be finite in (0) further implies a > 1. The finite-data free energy difference can now be written 



In — 



j dyp„(2/)log^e ^ + ^yj 



(8) 
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where c = exp {—f)/bi, a = {a—l)/a < 1/2, and p„ is the probabihty density of the variable y. Note that a is always 
positive because a > 1. 

To continue, we must call upon some mathematical results regarding the approach, with increasing n, to general 
stable limit distributions (of which the Gaussian, for a = 2, is the best known [[l4yi6[ ). More precisely, the sum of 
any set of random variables, suitably normalized as in (^, has a distribution with zero mean which may be expressed 
as a stable distribution function multiplied by a large-n asymptotic expansion ||l^ , [l9| . 

Finite-Moments Case and An Exact Solution. To illustrate the case of a Gaussian limit (a = 2), assume the 
variable e^™ possesses finite "Boltzmann moments" — a mean p, = e~-^ , variance ct^, and third moment jl^ — not to 
be confused with the moments of the distribution of w. The finite-n corrections to the central limit theorem indicate 
that the variable y = (^" e"™' — nfl)/y/na [cf. (0)] is distributed according to Q 

Pn(y) = pg(";1) [i + My)/V^ + ''2{y)/n + ---] , (9) 

for large n, where the remaining terms are higher integer powers of l/\/ri and the Gaussian density is 

poiy; <J) = exp {-yy2<j^)/V2^a , (10) 



The i^i depend on the original distribution of e^"^; for instance, I'liy) — {jl^/Qa^){y^ — iy) |14|. Moreover, the v 
functions are odd or even according to whether i is odd or even, in this a = 2 case. 

One arrives at the explicit form of the finite-data-corrected free energy for the case of finite &^ and fi^ by substituting 
(^ into (^, along with an expansion of the logarithm about y = 0. (More careful consideration of series convergence 
for large y yields the same final result for /„, as will be elucidated in future work.) Because of the odd- and even-ness 
of the factors to be integrated, one finds an expansion consisting solely of integer powers of 1/ri, namely, 

fn^ f + fi/n + Lp2/n^ + , (11) 

with If I — (T^/2/i^ and ip2 = ~(4/iA3 ~ %cr^)/12fL^ . To compare this with the finding of Wood et al. for /„ — /, one 
can consider a Gaussian distribution of = kBTw with variance ct^: expanding the resulting Boltzmann moments of 
previous result for small yields ifi = kBT[exp [(cr^/ZcsT)^] — 1]/2 « a'^,/2kBT, which yields precisely the first-order 
precdiction of Wood et al. ||ll| . 

Figure |l| illustrates the behavior of the finite-data free-energy for a Gaussian distribution of work values, based 
on numerical block averages (|^) and the asymptotic behavior given in (^ij). Although the leading term in /„ — / is 
linear in 1/n, the leading coefficient is exponential in the square of the distribution's width, while the next coefficient 
depends on the cube of the width. The asymptotic expressions (ll|) thus represent viable approximations only for a 
very small window about 1/n = for large widths. Fig. |l| shows that such behavior is easily mistaken for non-analytic 
(e.g., power-law) behavior. 

An exactly solvable case occurs when the Boltzmann factor e""" = z is distributed according to a gamma distribu- 
tion, namely. 



pv{z- b, q) - {z/by exp i-z/b) I bV{q) . (12) 

Because this density is "infinitely divisible" (see, e.g., [Q) the required sums in (^ also follow gamma distributions, 
and after performing the integration described in dq), one finds 

fnin:,b,q) ^ log [n/b) - tpinq) (13) 

where the digamma function is defined by tp{x) — {d/dx)r{x). The exact solution is illustrated in Fig. |2|for 6 = 10, 
q = 2. 

When asymmetry is added to a Gaussian distribution via the first Edgeworth correction (see, e.g., jl^]), one finds 
that the exponential dependence of the ipi on is only corrected linearly by the now non-zero third moment of the 
W distribution. 

Divergent Moments Case. When the variable e^"-' = z in (|^) possesses a long-tailed distribution p^, the limiting 
distribution is not a Gaussian and the results (^ and (|ll|) no longer hold. In particular, if one of the tails of Pz{z) 
decays as z^(i+") with a < 2 (implying an infinite Boltzmann variance, ct^), then the distribution of the variable y 
in (0) approaches a non-Gaussian stable law for large n jlj]. Note that such power-law behavior in z corresponds 
to simple exponential decay in the work distribution. Further, because the mean of e^"" must be finite for AF to 
exist [recall (ph] , we also have a > 1 . Unfortunately, no explicit forms for stable distributions are known in the range 
1< a< 2 |l§. 
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A long-tailed z distribution = p\ also alters the iorm of the asymptotic expansion of the sum-variable y dis- 
tribution and, hence, the expansion of /„. Instead of the distribution of the variable y (0) now takes the more 
complicated form [|9| 

Pn(y) = Po.{y) [i + Y^VuMln'^'''^'^\ , (14) 

where pa is the appropriate stable probability density with exponent a. The functions v^v, which are not available 
analytically, depend on the original distribution of e"*" and partial derivatives of the stable distribution. The exponents 
are given by 9{u,v) = {u + av)/a, and the summation ^* includes m > and v > —\u/2], where [x] denotes the 
integer part of x. Note that we have omitted an asymmetry parameter, /3 = 1, of the stable laws [ p6| which will be 
discussed in future work; it does not, however, affect the form of the expansions. 

Development of the expansion of /„ for large n in the case of diverging Boltzmann moments is more complicated, 
and will only be sketched here. The basic strategy is to ensure that the coefficients of the powers of 1/n are all 
rendered in terms of convergent integrals, which requires both an expansion of the logarithm in (H), as well as series 
and asymptotic expansions of p^ in (|l^) available from p"^-p^ . The asymptotic result for n oo takes a reasonably 
simple form, namely, 

/„-/«9'a-l(l/n)^""'^ (15) 

where ipa-i{a) depends on a and on the distribution pi in a complicated way; details will be presented in a future 
publication. 

Fluctuations and a Universal Law. The fluctuations in the finite-data free energy, /„ = AFn/ksT, as measured 
by the variance cr„ of J^n of (^, are of considerable interest because of their potential to provide parameter-free 
extrapolative estimates of foe — AF/kBT [|l^. Formally, the variance is given by 

dy p„(2/) [log (1 + y/cn'^)f - (/„ - ff . (16) 



Using techniques analogous to those sketched above yields asymptotic expansions for the fluctuations. In the case 
of finite Boltzmann moments, one finds 

{cJn/kBTf « {a/iif/n + O [n-^) , (17) 

where it should be recalled that the unsubscripted moments refer to the density pz- 
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FIG. 2. An exact solution and a universal law. The left panel illustrates the exact solution ( p^ for the analytic form of 
AFn/fcflT when the work Boltzmann factor e"'^'""^"^ is distributed according to a gamma distribution (|l|). The right plot 
illustrates the universal asymptotic behavior of the finite-data free energy difference as a function of its fluctuations, cr^; see 
and text. 



(18) 



Remarkably, comparison with tp^ for (|11|) shows that 

fn- f ={cJn/kBTf/2 + 0{n-'') 
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exactly, as n ^ oo, and independent of any parameters of the distribution. This universal law, valid for the case when 
the second Boltzmann moment is finite, is illustrated in Fig. |^. The gamma distribution of Boltzmann factors was 
Pt{z, 10, 2); see (12). The "regulated power law" distribution is defined by Prp{z) — a/{\-\- and we set a — 2.5. 

Conclusions. Motivated by the need to understand the large-iV asymptotic behavior of free-energy-difference 
estimates based on a finite amount of data [N work values), we have presented a general statistical theory which 
partially completes the task. Two cases were formally identified, distinguished by whether the second moment of 
the distribution of Boltzmann factors of the required work values is finite. The asymptotic behavior was discussed 
for both cases, and — for the finite-second-Boltzmann-moment case — an exact solution and a universal law were 
presented. 

Much remains to be done, both in terms of theory and applications. A question of particular practical interest 
is whether parameter-free extrapolation procedures can be devised, particularly in light of the sensitivity of the 
asymptotic behavior of lS.Fn to the width of the distribution of work values. 
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